Functions are an extremely important way to write concise code whenever we need to repeat the same “set of operations” multiple times.
Example:
fahrenheit_to_celsius <- function(temp_F){
temp_C <- (temp_F - 32) * 5 / 9
return(temp_C)
}
temp <- fahrenheit_to_celsius(32)
temp
[1] 0
We can compose the result of different functions
celsius_to_kelvin <- function(temp_C){
temp_K <- temp_C + 273.15
return(temp_K)
}
The conversion from Fahreneit to Kelvin can be obtained via
celsius_to_kelvin(fahrenheit_to_celsius(100))
[1] 310.9278
You can set default inputs as
mySum <- function(num1, num2 = 10){
out <- num1 + num2
return(out)
}
mySum(2)
[1] 12
mySum(2, 5)
[1] 7
You can return multiple outputs as
my_operations <- function(num1, num2) {
out <- num1 + num2
out1 <- num1 * num2
return(list('sum' = out, 'prod' = out1))
}
res <- my_operations(2, 5)
res$sum
[1] 7
res$prod
[1] 10
Remember the scoping rules: variables defined within a function are not visible outside of it. Conversely, the body of a function can see the variables that are outside of it.
Let us generate some data from a linear model.
set.seed(1234)
x <- rnorm(100)
y <- 1 + 2 * x + rnorm(100) # rnorm(100) is the additive noise
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
Simple linear fit: we use the lm()
function
fit <- lm(y ~ x) # y = \beta_0 + \beta_1 * x
summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.88626 -0.61401 0.00236 0.58645 2.98774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0372 0.1050 9.88 <2e-16 ***
x 1.9739 0.1038 19.02 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.037 on 98 degrees of freedom
Multiple R-squared: 0.7869, Adjusted R-squared: 0.7847
F-statistic: 361.8 on 1 and 98 DF, p-value: < 2.2e-16
Remember how to interpret this table:
There are (at least) two ways to plot the regression line:
predict()
abline()
and input the regression coefficientsMethod 1:
xgrid <- seq(min(x), max(x), length.out = 100)
beta_hat <- coef(fit)
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
lines(xgrid, as.numeric(predict(fit, newdata = list(x = xgrid))), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)
Method 2:
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
abline(coef(fit), col = cols[1], lwd = 3)
points(mean(x), mean(y), col = cols[2], cex = 1.5, pch = 16)
The advantage of method 1 is that it generalizes to any statistical model.
Let’s take a look at the values of \(\hat{\beta}_0, \hat{\beta}_1\). We could have just computed them by hand, right? Remember that \[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}, \quad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}.\]
my_beta_hat <- c(mean(y) - cor(x, y) * sd(y)/sd(x) * mean(x), cor(x, y) * sd(y)/sd(x))
print(cbind(beta_hat, my_beta_hat))
beta_hat my_beta_hat
(Intercept) 1.037154 1.037154
x 1.973915 1.973915
What happens if we replicate the data generating mechanism?
B <- 1000
betas <- array(NA, dim = c(B, 2))
plot(x, y, xlab = 'x', ylab = 'y', pch = 16)
for (b in 1:B){
x <- rnorm(100)
y <- 1 + 2 * x + rnorm(100)
fit <- lm(y ~ x)
betas[b,] <- coef(fit)
abline(fit, col = alpha(cols[1], 0.05))
}
More complicated models:
y ~ x # y = beta_0 + beta_1 * x
y ~ 0 + x # y = beta_1 * X
y ~ x1 + x2 # y = beta_0 + beta_1 * x1 + beta_2 * x2
y ~ x1 * x2 # y = beta_0 + beta_1 * x1 + beta_2 * x2 + beta_3 * x1 * x2
y ~ I(x / 2) # y = beta_0 + beta_1 * (x / 2)
Note that I
evaluates argument as a regular R expression. Also, you should know that the lm
function is not doing anything fancy. It is simply solving efficiently the normal equations and reporting a bunch of test statistics. In principle, you could code your own LinearModel
function. You just have to remember that
\[\widehat{\boldsymbol{\beta}} = (X^{\intercal} X)^{-1} X^{\intercal} \mathbf{y},\]
which is the multivariate equivalent of the formulas that we have seen in class.
The goal is to use the dataset Boston
to predict median house value using 13 predictors such as rm
(average number of rooms per house), age
(average age of houses), and lstat
(percent of households with low socioeconomic status). We first start with a univariate regression in which the only covariate is lstat
. As usual, let us start by loading the libraries that we will use.
rm(list=ls())
setwd("~/Desktop/SDS323 - Fa20/Laboratories/Lab 2 - Regression/")
library(ggplot2)
library(latex2exp)
library(RColorBrewer)
library(ISLR)
library(MASS)
library(car)
library(plotrix)
theme_set(theme_bw(base_size = 14))
cols <- brewer.pal(9, "Set1")
names(Boston)
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
?Boston # get information on the data set
fit1 <- lm(medv ~ lstat, data = Boston)
summary(fit1) # summary of the linear model fit is more useful
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
coef(fit1)
(Intercept) lstat
34.5538409 -0.9500494
confint(fit1) # CI for the intercept and slope
2.5 % 97.5 %
(Intercept) 33.448457 35.6592247
lstat -1.026148 -0.8739505
par(mar=c(4,4,2,2))
plotCI(x = 1:2, y = coef(fit1), ui = confint(fit1)[,2], li = confint(fit1)[,1],
lwd = 2, xaxt = 'n', xlab = '', ylab = 'value', pch = 16)
axis(1, at = 1:2, labels = c(TeX("$\\beta_0$"), TeX("$\\beta_1$")))
abline(h = 0, lwd = 2, lty = 2, col = cols[1])
We can make prediction on new x values using CI or PI. Remember the difference?
predict(fit1, data.frame(lstat = c(5,10,15)), interval = "confidence")
fit lwr upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461
predict(fit1, data.frame(lstat = c(5,10,15)), interval = "prediction")
fit lwr upr
1 29.80359 17.565675 42.04151
2 25.05335 12.827626 37.27907
3 20.30310 8.077742 32.52846
Let us show the CI for the regression line, i.e. for the quantity \(\mathbb{E}[Y \mid X]\)
xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv') # scatter plot of medv and lstat
abline(fit1, col = cols[1], lwd = 3) # add the regression line in the scatter plot
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "confidence")
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])),
col = alpha('gray', 0.5), border = alpha('gray', 0.5))
It is widely different from the PI for a new observation, i.e. for the quantity \(Y^\star\).
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
abline(fit1, col = cols[1], lwd = 3)
y_hat <- predict(fit1, data.frame(lstat = xgrid), interval = "prediction")
polygon(c(xgrid,rev(xgrid)), c(y_hat[,'lwr'],rev(y_hat[,'upr'])),
col = alpha('gray', 0.5), border = alpha('gray', 0.5))
We have already noted that some data points seem to be problematic.
par(mfrow = c(2,2))
plot(fit1) # make residual plot
par(mfrow = c(1,1))
plot(hatvalues(fit1), pch = 16) # hat value is the measure for leverage
abline(h = length(coef(fit1))/nrow(Boston), lwd = 2, lty = 2, col = cols[1])
Let us see how the fitted values compare to the observed ones
y_hat <- as.numeric(predict(fit1, data.frame(lstat = Boston$lstat)))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])
The MSE (mean squared error) is
mean((y_hat - Boston$medv)^2)
[1] 38.48297
How would you compute the RSS?
Let us now use multiple predictors! Let’s first take a look at the data.
pairs(Boston)
This does not help much, but hopefully the regression model can shed a light on some of these relationships. Before starting, we can take a look at the correlations in the data.
library(ggcorrplot)
ggcorrplot(round(cor(Boston), 2), hc.order = TRUE, type = "lower",
lab = TRUE)
In the multivariate case, there are two equivalent ways of writing a linear model with lm()
.
data = Boston
) and use the regular syntax \(Y \sim X_1 + \dots + X_p\)model.matrix()
fit2 <- lm(medv ~ lstat + age, data = Boston) # same lm command for simple or multiple linear regression
summary(fit2)
Call:
lm(formula = medv ~ lstat + age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.981 -3.978 -1.283 1.968 23.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
lstat -1.03207 0.04819 -21.416 < 2e-16 ***
age 0.03454 0.01223 2.826 0.00491 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
X <- model.matrix(medv ~ lstat + age, data = Boston)
head(X)
(Intercept) lstat age
1 1 4.98 65.2
2 1 9.14 78.9
3 1 4.03 61.1
4 1 2.94 45.8
5 1 5.33 54.2
6 1 5.21 58.7
Y <- Boston$medv
fit2 <- lm.fit(X, Y)
coef(fit2)
(Intercept) lstat age
33.22276053 -1.03206856 0.03454434
fit3 <- lm(medv ~ ., Boston) # . means all variables in the data set except the response
summary(fit3)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit3)
fit4 <- lm(medv ~ .- age, Boston) # the "-" means "do not include a particular variable"
summary(fit4)
Call:
lm(formula = medv ~ . - age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.6054 -2.7313 -0.5188 1.7601 26.2243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
crim -0.108006 0.032832 -3.290 0.001075 **
zn 0.046334 0.013613 3.404 0.000719 ***
indus 0.020562 0.061433 0.335 0.737989
chas 2.689026 0.859598 3.128 0.001863 **
nox -17.713540 3.679308 -4.814 1.97e-06 ***
rm 3.814394 0.408480 9.338 < 2e-16 ***
dis -1.478612 0.190611 -7.757 5.03e-14 ***
rad 0.305786 0.066089 4.627 4.75e-06 ***
tax -0.012329 0.003755 -3.283 0.001099 **
ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
black 0.009321 0.002678 3.481 0.000544 ***
lstat -0.523852 0.047625 -10.999 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
Let us see how the fitted values compare to the observed ones
y_hat <- as.numeric(predict(fit3, newdata = Boston))
plot(Boston$medv, y_hat, pch = 16, xlab = 'observed', ylab = 'fitted')
abline(a = 0, b = 1, lty = 2, lwd = 2, col = cols[1])
The MSE (mean squared error) is
mean((y_hat - Boston$medv)^2)
[1] 21.89483
Something we have not considered here but can be tried:
fit5 <- lm(medv ~ lstat * age, Boston) # * is for interaction term
summary(fit5)
Call:
lm(formula = medv ~ lstat * age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.806 -4.045 -1.333 2.085 27.552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
age -0.0007209 0.0198792 -0.036 0.9711
lstat:age 0.0041560 0.0018518 2.244 0.0252 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
fit6 <- lm(medv ~ lstat + I(lstat^2), Boston) # need to have I() to protect the square term
summary(fit6)
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.2834 -3.8313 -0.5295 2.3095 25.4148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.862007 0.872084 49.15 <2e-16 ***
lstat -2.332821 0.123803 -18.84 <2e-16 ***
I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
fit7 <- lm(medv ~ poly(lstat, 4, raw = T), Boston)
summary(fit7)
Call:
lm(formula = medv ~ poly(lstat, 4, raw = T), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-13.563 -3.180 -0.632 2.283 27.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.731e+01 2.280e+00 25.134 < 2e-16 ***
poly(lstat, 4, raw = T)1 -7.028e+00 7.308e-01 -9.618 < 2e-16 ***
poly(lstat, 4, raw = T)2 4.955e-01 7.489e-02 6.616 9.50e-11 ***
poly(lstat, 4, raw = T)3 -1.631e-02 2.994e-03 -5.448 7.98e-08 ***
poly(lstat, 4, raw = T)4 1.949e-04 4.043e-05 4.820 1.90e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.28 on 501 degrees of freedom
Multiple R-squared: 0.673, Adjusted R-squared: 0.6704
F-statistic: 257.8 on 4 and 501 DF, p-value: < 2.2e-16
xgrid <- seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)
y_hat6 <- as.matrix(model.matrix(~ xgrid + I(xgrid^2))) %*% coef(fit6)
y_hat7 <- as.matrix(model.matrix(~ poly(xgrid, 4, raw = T))) %*% coef(fit7)
par(mfrow = c(1,1))
plot(Boston$lstat, Boston$medv, pch = 16, xlab = 'lstat', ylab = 'medv')
lines(xgrid, y_hat6, col = cols[1], lwd = 3)
lines(xgrid, y_hat7, col = cols[2], lwd = 3)
legend('topright', col = cols[1:2], legend = c('Quadratic','Fourth degree'), lwd = 3)
In greenmoving.txt there is a list of travel times (in minutes) as a function of the length of the route [in m] and of the means of transportation used (walking, cycling, driving).
It is a categorical variable with three levels: we need two dummy variables.
green <- read.table('./Data/greenmoving.txt', header=T)
# Let us transform the transport variable in a factor!
green$transport <- factor(green$transport)
walk_idx <- as.numeric(green$transport == 'walk')
bike_idx <- as.numeric(green$transport == 'bike')
car_idx <- as.numeric(green$transport == 'car')
ggplot() +
geom_point(aes(x = length, y = time, col = transport), data = green) +
labs(x = 'distance', y = 'time') +
scale_color_brewer(name = 'Transport', palette = 'Set1')
The reference is going to be bike (first in alphabetical order in the factor)!
\[time = \beta_0 + \beta_1*car + \beta_2*walk + \beta_3*length + \beta_4*length*car + \beta_5*length*walk + \varepsilon\]
This is a model with interactions! Therefore we have a slope and an intercept for every level of transport
.
Bike regression line: \(\beta_0 + \beta_3*length + \varepsilon\)
Car regression line: \((\beta_0 + \beta_1) + (\beta_3 + \beta_4)*length + \varepsilon\)
Walk regression line: \((\beta_0 + \beta_2) + (\beta_3 + \beta_5)*length + \varepsilon\)
fit <- lm(time ~ car_idx + walk_idx + length + length:car_idx + length:walk_idx, data = green)
summary(fit)
Call:
lm(formula = time ~ car_idx + walk_idx + length + length:car_idx +
length:walk_idx, data = green)
Residuals:
Min 1Q Median 3Q Max
-8.0179 -2.0579 0.0629 2.1336 6.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5667208 1.3487377 3.386 0.000915 ***
car_idx 1.2107320 1.6665020 0.727 0.468705
walk_idx -1.1597079 1.9965712 -0.581 0.562250
length 0.0052521 0.0013307 3.947 0.000123 ***
car_idx:length -0.0002884 0.0013484 -0.214 0.830913
walk_idx:length 0.0065389 0.0018779 3.482 0.000659 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared: 0.899, Adjusted R-squared: 0.8955
F-statistic: 256.4 on 5 and 144 DF, p-value: < 2.2e-16
ggplot() +
geom_point(aes(x = length, y = time, col = transport), data = green) +
geom_abline(intercept = sum(coef(fit)[1]), slope = sum(coef(fit)[4]), col = cols[1]) +
geom_abline(intercept = sum(coef(fit)[1:2]), slope = sum(coef(fit)[4:5]), col = cols[2]) +
geom_abline(intercept = sum(coef(fit)[c(1,3)]), slope = sum(coef(fit)[c(4,6)]), col = cols[3]) +
labs(x = 'distance', y = 'time') +
scale_color_brewer(name = 'Transport', palette = 'Set1')
There is a much better way of doing this. Instead of creating the dummy variables by hand, we let R
take care of it. We only have to declare that the variable transport
is a factor!
fit <- lm(time ~ transport + length + length:transport, data = green)
summary(fit)
Call:
lm(formula = time ~ transport + length + length:transport, data = green)
Residuals:
Min 1Q Median 3Q Max
-8.0179 -2.0579 0.0629 2.1336 6.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5667208 1.3487377 3.386 0.000915 ***
transportcar 1.2107320 1.6665020 0.727 0.468705
transportwalk -1.1597079 1.9965712 -0.581 0.562250
length 0.0052521 0.0013307 3.947 0.000123 ***
transportcar:length -0.0002884 0.0013484 -0.214 0.830913
transportwalk:length 0.0065389 0.0018779 3.482 0.000659 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared: 0.899, Adjusted R-squared: 0.8955
F-statistic: 256.4 on 5 and 144 DF, p-value: < 2.2e-16
Some dummy variables do not seem to be significant. Let us verify it in a more precise way.
The means of transportation is significant if AT LEAST one of the coefficients related to the dummy variables is significantly different from 0, i.e. the null hypothesis is \[\beta_{1} = \beta_{2} = \beta_{4} = \beta_{5} = 0.\]
rbind(c(0,1,0,0,0,0),
c(0,0,1,0,0,0),
c(0,0,0,0,1,0),
c(0,0,0,0,0,1))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 1 0 0 0 0
[2,] 0 0 1 0 0 0
[3,] 0 0 0 0 1 0
[4,] 0 0 0 0 0 1
linearHypothesis(fit,
rbind(c(0,1,0,0,0,0),
c(0,0,1,0,0,0),
c(0,0,0,0,1,0),
c(0,0,0,0,0,1)),
c(0,0,0,0))
Linear hypothesis test
Hypothesis:
transportcar = 0
transportwalk = 0
transportcar:length = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
Res.Df RSS Df Sum of Sq F Pr(>F)
1 148 2606.3
2 144 1415.0 4 1191.3 30.309 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, it appears that the transportation does influence the response.
The length of the route is significant if AT LEAST one of the coefficients where it appears is significantly different from 0, i.e. the null hypothesis is \[\beta_{3} = \beta_{4} = \beta_{5} = 0.\]
linearHypothesis(fit,
rbind(c(0,0,0,1,0,0),
c(0,0,0,0,1,0),
c(0,0,0,0,0,1)),
c(0,0,0))
Linear hypothesis test
Hypothesis:
length = 0
transportcar:length = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
Res.Df RSS Df Sum of Sq F Pr(>F)
1 147 7452.2
2 144 1415.0 3 6037.2 204.8 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Thus, it appears that the length of the route does influence the response.
Let’s look at the p-values for the individual tests!
summary(fit)
Call:
lm(formula = time ~ transport + length + length:transport, data = green)
Residuals:
Min 1Q Median 3Q Max
-8.0179 -2.0579 0.0629 2.1336 6.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.5667208 1.3487377 3.386 0.000915 ***
transportcar 1.2107320 1.6665020 0.727 0.468705
transportwalk -1.1597079 1.9965712 -0.581 0.562250
length 0.0052521 0.0013307 3.947 0.000123 ***
transportcar:length -0.0002884 0.0013484 -0.214 0.830913
transportwalk:length 0.0065389 0.0018779 3.482 0.000659 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.135 on 144 degrees of freedom
Multiple R-squared: 0.899, Adjusted R-squared: 0.8955
F-statistic: 256.4 on 5 and 144 DF, p-value: < 2.2e-16
The summary is suggesting only one intercept and only two slopes (walk vs other transportation). Let’s test if we can block remove those three coefficients using as null hypothesis \[\beta_{1} = \beta_{2} = \beta_{5} = 0.\]
linearHypothesis(fit,
rbind(c(0,1,0,0,0,0),
c(0,0,1,0,0,0),
c(0,0,0,0,0,1)),
c(0,0,0))
Linear hypothesis test
Hypothesis:
transportcar = 0
transportwalk = 0
transportwalk:length = 0
Model 1: restricted model
Model 2: time ~ transport + length + length:transport
Res.Df RSS Df Sum of Sq F Pr(>F)
1 147 2296.7
2 144 1415.0 3 881.7 29.91 4.315e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reduce the model to \[time = \beta_0 + \beta_1*length + \beta_2*length*walk + \varepsilon\]
fit2 <- lm(time ~ length + length:walk_idx, data = green)
summary(fit2)
Call:
lm(formula = time ~ length + length:walk_idx, data = green)
Residuals:
Min 1Q Median 3Q Max
-8.1640 -1.9820 0.0539 2.0756 6.4508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.7579758 0.4581664 10.38 <2e-16 ***
length 0.0051614 0.0001438 35.90 <2e-16 ***
length:walk_idx 0.0054701 0.0004992 10.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.124 on 147 degrees of freedom
Multiple R-squared: 0.8976, Adjusted R-squared: 0.8962
F-statistic: 644.5 on 2 and 147 DF, p-value: < 2.2e-16
The three regression lines have the same intercept: 4.758.
The bike regression line and the car regression line have the same slope: 0.005.
The walk regression line has the slope: 0.0054701.
Car and bike have the same regression line.
ggplot() +
geom_point(aes(x = length, y = time, col = transport), data = green) +
geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2]), col = 'black') +
geom_abline(intercept = sum(coef(fit2)[1]), slope = sum(coef(fit2)[2:3]), col = cols[3]) +
labs(x = 'distance', y = 'time') +
scale_color_brewer(name = 'Transport', palette = 'Set1')
The file smog.txt reports the concentrations of PM10 recorded by some weather stations in a town in Alaska between February 1 and February 10, 2010. The mayor of the town has forbidden the use of domestic heaters starting from February 7. City council requests some statistical analysis to study the efficacy of the ban on heaters.
estimate the parameters of the following models (\(t = 1\) denotes February 1):
smog <- read.table('./Data/smog.txt', header=T)
head(smog)
day station PM10
1 1 1 41.605
2 1 2 41.622
3 1 3 43.427
4 1 4 42.276
5 1 5 42.583
6 2 1 44.219
Let us try and write the possible models:
intervention <- ifelse(smog$day > 6.5, 1, 0)
par(mar=c(4,4,2,2), family = 'serif')
plot(smog$day, smog$PM10, pch = 16, col = alpha('black', 0.5),
xlab = 'day', ylab = 'PM10')
abline(v = 6.5, lty = 3)
fit1 <- lm(PM10 ~ day, smog)
fit2 <- lm(PM10 ~ day + I((day - 6.5) * intervention), smog)
fit3 <- lm(PM10 ~ day + intervention + I((day - 6.5) * intervention), smog)
summary(fit1)
Call:
lm(formula = PM10 ~ day, data = smog)
Residuals:
Min 1Q Median 3Q Max
-2.4600 -0.8402 -0.1131 0.6881 3.5109
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.14209 0.38515 104.22 <2e-16 ***
day 2.07188 0.06207 33.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.261 on 48 degrees of freedom
Multiple R-squared: 0.9587, Adjusted R-squared: 0.9578
F-statistic: 1114 on 1 and 48 DF, p-value: < 2.2e-16
summary(fit2)
Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)
Residuals:
Min 1Q Median 3Q Max
-2.2468 -1.0178 0.0062 0.7391 3.4297
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6276 0.4811 82.364 <2e-16 ***
day 2.2314 0.1107 20.150 <2e-16 ***
I((day - 6.5) * intervention) -0.4539 0.2632 -1.724 0.0912 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared: 0.9612, Adjusted R-squared: 0.9595
F-statistic: 581.5 on 2 and 47 DF, p-value: < 2.2e-16
summary(fit3)
Call:
lm(formula = PM10 ~ day + intervention + I((day - 6.5) * intervention),
data = smog)
Residuals:
Min 1Q Median 3Q Max
-2.33462 -0.67789 -0.03462 0.67975 2.63922
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.3889 0.4120 98.038 < 2e-16 ***
day 1.9303 0.1058 18.247 < 2e-16 ***
intervention 3.0407 0.5822 5.223 4.15e-06 ***
I((day - 6.5) * intervention) -0.8555 0.2244 -3.812 0.000408 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9895 on 46 degrees of freedom
Multiple R-squared: 0.9756, Adjusted R-squared: 0.974
F-statistic: 613.4 on 3 and 46 DF, p-value: < 2.2e-16
day_grid <- seq(min(smog$day), max(smog$day), length.out = 200)
intervention_grid <- ifelse(day_grid > 6.5, 1, 0)
lines(day_grid, predict(fit1, newdata = list(day = day_grid)), col = cols[1], pch = 16, lwd = 2)
lines(day_grid, predict(fit2, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[2], pch = 16, lwd = 2)
lines(day_grid, predict(fit3, newdata = list(day = day_grid, intervention = intervention_grid)), col = cols[3], pch = 16, lwd = 2)
legend('topleft', legend = c(TeX('$C^1$'),TeX('$C^0$'),TeX('$C^{-1}$')), col = cols[1:3], lwd = 2)
summary(fit2)
Call:
lm(formula = PM10 ~ day + I((day - 6.5) * intervention), data = smog)
Residuals:
Min 1Q Median 3Q Max
-2.2468 -1.0178 0.0062 0.7391 3.4297
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.6276 0.4811 82.364 <2e-16 ***
day 2.2314 0.1107 20.150 <2e-16 ***
I((day - 6.5) * intervention) -0.4539 0.2632 -1.724 0.0912 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 47 degrees of freedom
Multiple R-squared: 0.9612, Adjusted R-squared: 0.9595
F-statistic: 581.5 on 2 and 47 DF, p-value: < 2.2e-16
Remember the model: this corresponds to the hypothesis \(\beta_1 + \beta_2 \text{ (slope after intervention)} = \frac{1}{2} \beta_1 \text{ (slope before intervention)}\).
This corresponds to \(H_0: \beta_1 + 2 \beta_2 = 0\).
linearHypothesis(fit2,
rbind(c(0,1,2)),
0)
Linear hypothesis test
Hypothesis:
day + 2 I((day - 6.5) * intervention) = 0
Model 1: restricted model
Model 2: PM10 ~ day + I((day - 6.5) * intervention)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 48 85.678
2 47 71.749 1 13.929 9.124 0.004071 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also calculate the confidence interval for a linear combination of the parameters:
library(multcomp)
hyp_matrix <- array(0, dim = c(2, length(coef(fit2)))) # 2 is the number of CI to compute, length(coef(fit2)) is the model dimension
hyp_matrix[1,] <- c(0,1,0)
hyp_matrix[2,] <- c(0,1,1)
confcm <- glht(fit2, linfct = hyp_matrix)
confcm <- confint(confcm)
plotCI(x = 1:2, y = confcm$confint[,1], ui = confcm$confint[,3],
li = confcm$confint[,2], pch = 21, lwd = 2, xaxt = 'n', xlab = '',
ylab = 'value', col = cols[1])
axis(1, at = 1:2, labels = c(TeX('$\\beta_1$'), TeX('$\\beta_1 + \\beta_2$')))